UMBC High Performance Computing Facility : Monitoring and Controlling Jobs on HPC
This page last changed on Jan 18, 2009 by straha1.
Now that you've learned to compile programs and submit jobs, you need to know how to monitor and delete them. (Make sure you read the QDel section of this page.) The PBS queuing system includes a number of programs for examining the PBS queue and monitoring or controlling your jobs. This page discusses the following topics:
For information on submitting jobs using QSub, see the second part of this tutorial or our page on QSub: Using QSub. All three of these commands have manual pages which can be accessed through the UNIX man program: man qstat man qdel man qsub For detailed information on QSub, QStat and QDel, see Running Jobs on HPC. QDel: Canceling a JobOccasionally you might realize you messed up an input parameter, typed the wrong executable name or made some other mistake. Rather than letting your incorrectly-configured job run, you can cancel it using the qdel command: qdel 3172.hpc.cl.rs.umbc.edu where "3172.hpc.cl.rs.umbc.edu" is the job number returned from qsub. (If you forgot your job number, you can use qstat to determine what it is.) QDel can even cancel your job after it has started running. It may take a minute or two for your job to be deleted from the queue. You can use qstat to monitor the progress of the deletion. QStat: Job Status InformationExamining Your JobsYour job might be sitting in the queue for a while before it runs, depending on how many people are using the cluster. You can check the status of your job using qstat: qstat 3172.hpc.cl.rs.umbc.edu where 3172.hpc.cl.rs.umbc.edu should be replaced by whatever job number qsub returned. If your job is in the queue or running, that command should print out a message much like this: Job id Name User Time Use S Queue ------------------- ---------------- --------------- -------- - ----- 3172.hpc hello_parallel straha1 0 R low_priority where straha1 is replaced by your user name. The R indicates that your job is running. If you see a Q there, then your job is in the queue waiting to run. If qstat gives you this message: qstat: Unknown Job Id 3172.hpc.cl.rs.umbc.edu then your job has either aborted, completed normally or been deleted. You can get much more detailed information about your job using the -f option to qsub: qstat -f 3172.hpc.cl.rs.umbc.edu which will print out extensive information, including the number of nodes used, the number of processors per node, which nodes were allocated, the queue, and much more. Examining the PBS QueueYou can see the list of all jobs in the queue by simply typing qstat (without any job number or options) which might produce something like this: Job id Name User Time Use S Queue ------------------- ---------------- --------------- -------- - ----- 3166.hpc MPI_DG gobbert 00:13:18 R low_priority 3167.hpc MPI_DG gobbert 00:41:01 R low_priority 3168.hpc MPI_DG gobbert 01:33:29 R low_priority 3171.hpc llcbench straha1 00:13:24 R low_priority 3172.hpc hello_parallel straha1 0 Q low_priority You can see details about other peoples' jobs using the same qstat -f command described in the previous section. If you notice that the cluster is especially busy right now, you may wish to wait before trying to debug a new MPI program, otherwise you might be waiting an hour or more every time you start the program. |
Document generated by Confluence on Mar 31, 2011 15:37 |